Sampling & Probability

Sampling

Research Question

Every research project aims to answer a research question (or multiple questions).

Do ECU students who exercise regularly have a higher GPA?

Population

Each research question aims to examine a population.

Population for this research question is ECU students.

Survey Questions

  • Do you exercise at least once a week?
  • What is your GPA?

Sampling

  • It is impossible to study the whole population related to a research question.

  • A sample \(n\) is a subset of the population \(N\).

  • The Goal: Select a representative sample to generalize to the broader population.

What is representative?

Convenience Sampling

  • Sample is easy to access.


  • Example:
    • Stand in front of Joyner Library.
    • Give the survey to 100 ECU students.


  • Issue:
    • Will introduce sampling bias

PSA x2

Data quality matters more than data quantity


Many anthropological studies (or similar) are convenience based.

Simple Random Sample

Every member of a population has an equal chance of being selected.

  • Examples:
    • Reach out to the registrar for student emails
    • Randomly select 100 students
    • Email students the survey

Simple Random Sampling in R

sample(1:100, 3, replace = FALSE)
[1] 36 88 66

To Generalize:

sample(x = 1:N, size = n, replace = FALSE)

Systematic Sampling

Similar to a simple random sample BUT intervals are chosen at regular intervals.

# 1. Create a population (e.g., a vector of 1 to 1000)
population <- 1:1000

# 2. Define the desired sample size
sample_size <- 100

# 3. Calculate the sampling interval (k)
N <- length(population) # Population size
k <- N / sample_size
# If k is not an integer, you might use ceiling(N/n) and adjust the logic

# 4. Choose a random starting point (r) between 1 and k
set.seed(123) # Optional: for reproducible results
start_point <- sample(1:k, 1)

# 5. Select every k-th element starting from the random start point
systematic_sample_indices <- seq(from = start_point, to = N, by = k)
systematic_sample <- population[systematic_sample_indices]

# 6. View the first few elements and the dimension of the sample
head(systematic_sample)
[1]  3 13 23 33 43 53
length(systematic_sample) # Should be the desired sample size (100)
[1] 100

Subgroups

Stratified Random Sampling in R

library(dplyr)

# Sample data
set.seed(123) # For reproducibility
data <- data.frame(
  ID = 1:100,
  Gender = sample(c("Male", "Female"), 100, replace = TRUE),
  Income = rnorm(100, mean = 50000, sd = 10000)
)
# Stratified sampling with sample_n()
sampled_data_n <- data %>%
  group_by(Gender) %>%
  sample_n(10)

# View the sampled data
# sampled_data_n %>% count(Gender)

Single Stage Cluster Sampling in R

set.seed(123)

population <- data.frame(
  Supermarket = paste("Supermarket", 1:1000, sep = "_"),
  CustomerSatisfaction = rnorm(1000, mean = 75, sd = 10)
)

selected_supermarkets <- sample(population$Supermarket, size = 10, replace = FALSE)

sampled_data <- population[population$Supermarket %in% selected_supermarkets, ]

head(sampled_data)
        Supermarket CustomerSatisfaction
203 Supermarket_203             72.34855
225 Supermarket_225             71.36343
255 Supermarket_255             90.98509
354 Supermarket_354             76.16637
457 Supermarket_457             86.10277
554 Supermarket_554             77.49825

2-Stage Cluster Sampling

set.seed(123)

region <- data.frame(
  Neighborhood = paste("Neighborhood", 1:500, sep = "_"),
  AverageIncome = rnorm(500, mean = 50000, sd = 10000)
)

households <- data.frame(
  Neighborhood = rep(sample(region$Neighborhood, size = 500, replace = TRUE), each = 20),
  HouseholdID = rep(1:20, times = 500),
  EmploymentStatus = sample(c("Employed", "Unemployed"), size = 10000, replace = TRUE)
)

selected_neighborhoods <- sample(region$Neighborhood, size = 5, replace = FALSE)

sampled_households <- households[households$Neighborhood %in% selected_neighborhoods, ]

head(sampled_households)
         Neighborhood HouseholdID EmploymentStatus
1981 Neighborhood_302           1       Unemployed
1982 Neighborhood_302           2         Employed
1983 Neighborhood_302           3         Employed
1984 Neighborhood_302           4         Employed
1985 Neighborhood_302           5       Unemployed
1986 Neighborhood_302           6       Unemployed

Multi-Stage Cluster Sampling

set.seed(123)
states <- data.frame(
  State = paste("State", 1:50, sep = "_"),
  Population = sample(1000000:5000000, 50, replace = TRUE)
)
counties <- data.frame(
  State = rep(sample(states$State, size = 50, replace = TRUE), each = 20),
  County = rep(paste("County", 1:20, sep = "_"), times = 50),
  VaccinationRate = rnorm(1000, mean = 70, sd = 5)
)
selected_states <- sample(states$State, size = 3, replace = FALSE)
selected_counties <- sample(counties$County[counties$State %in% selected_states], size = 5, replace = FALSE)
sampled_vaccination_centers <- counties[counties$County %in% selected_counties, ]
head(sampled_vaccination_centers)
      State    County VaccinationRate
8  State_32  County_8        70.37428
11 State_32 County_11        66.86024
13 State_32 County_13        70.81309
15 State_32 County_15        67.68222
19 State_32 County_19        70.91839
28 State_46  County_8        68.84869

# 1. Define Population Distribution (e.g., skewed population)
set.seed(123)
population <- rgamma(100000, shape = 2, scale = 2)

# 2. Take a Sample Distribution (e.g., 100 individuals)
sample_data <- sample(population, 100)

From Sampling to Probability

How do we infer future events or population characteristics?

Random Process

In a random process there is more than one possible outcome.

  • The deterministic prediction of the outcome is difficult to impossible.
sample(x = c("H", "T"), size = 10, replace = T)
 [1] "T" "T" "T" "T" "H" "T" "H" "T" "H" "H"
sample(x = c(1:6), size = 10, replace = T)
 [1] 3 6 5 6 3 5 4 1 5 2

Random vs. Deterministic Processes

Sample Space

The set of all possible outcomes of a random process.

Event

An event is a subset of the sample space.

  • Examples with a 6-sided die:

    • Let A represent the event that a single roll die results in an even number.
      • A = {2, 4, 6}
    • Let B represent the event that a single roll die results in an odd number.
      • B = {1, 3, 5}
    • Let C represent the event that a single roll die results in a prime number.
      • C = {2, 3, 5}

Complement of an Event

The set of all outcomes in the sample space that are not in the event itself.

  • Example:

    • Let C represent the event that a single roll die results in a prime number.
      • C = {2, 3, 5}
    • Notation: \(C^C\)
      • C complement
    • \(C^C\) = {1, 4, 6}

Mutually Exclusive or Disjoint Events

    • Let A represent the event that a single roll die results in an even number.
      • A = {2, 4, 6}
    • Let B represent the event that a single roll die results in an odd number.
      • B = {1, 3, 5}
    • Let C represent the event that a single roll die results in a prime number.
      • C = {2, 3, 5}

Events \(A\) and \(B\) are mutually exclusive because an outcome cannot be both even + odd.

Events \(A\) and \(C\) are not mutually exclusive because the outcome 2 is both even + prime.

Events & Set Notation



Description Notation Reading Elements
Union \(A \cup C\) A or C {2, 3, 4, 5, 6}
Intersection \(A \cap C\) A and C {2}

set.seed(1)
OBV <- 1:10
Dist1 <- NULL
Dist9 <- NULL
Dist16 <- NULL
Dist25 <- NULL
Dist36 <- NULL
count = 100
while(count > 0){Dist1 <- c(Dist1,sample(OBV, 1, replace = TRUE)); count <- count - 1}
count = 100
while(count > 0){Dist9 <- c(Dist9,mean(sample(OBV, 9,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist16 <- c(Dist16,mean(sample(OBV, 16,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist25 <- c(Dist25,mean(sample(OBV, 25,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist36 <- c(Dist36,mean(sample(OBV, 36,replace = TRUE) ) ); count <- count - 1}
Dist.df <- data.frame(Size = factor(rep(c(1,9,16,25,36), each=100)), Sample_Means = c(Dist1, Dist9, Dist16, Dist25, Dist36) )
ggplot(Dist.df, aes(Sample_Means, fill = Size)) + geom_histogram() + facet_grid(. ~ Size)

What is probability?

The likelihood of some event occurring…

How do you view your problem?

  • Option 1: Your answer is based on the frequency of events



  • Option 2: Your answer is based upon your degree of belief in your data AND the system at hand.

Frequentist Probability

The probability of an outcome is defined to be the proportion of times the outcome is observed under high number of repetitions of the random process.

Assume that we are repeating the random process of a coin flip and are recording \(X\), the number of heads in \(n\) coin flips. Then:

\[ P(H) = \lim_{n\to\infty}\frac{X}{n} \]

\[ P(H) =\frac{1}{2} \]

set.seed(42)  # for reproducibility

# Number of coin flips
n_flips <- 10000

# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)

# Compute cumulative mean of rolling a '1'
cumulative_mean <- cumsum(flips == "H") / (1:n_flips)

# Plot convergence
plot(1:n_flips, cumulative_mean, type = "l", col = "blue", lwd = 2,
     xlab = "Number of Flips", ylab = "Proportion of Heads",
     main = "Law of Large Numbers: Convergence to 1/2")
abline(h = 1/2, col = "red", lty = 2, lwd = 2)  # Reference line at 1/2

Bayesian Probability

The probability of an outcome is a degree of belief or reasonable expectation quantifying one’s state of knowledge based on observed events and prior knowledge.

set.seed(42)  # For reproducibility

# Number of dice rolls
n_flips <- 10000

# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)

# Prior: Beta(1,5) (weak prior belief about p = 1/6)
alpha <- 1  # prior successes (rolling a 1)
beta <- 2   # prior failures (rolling 2-6)

# Store posterior mean estimates
posterior_means <- numeric(n_flips)

# Bayesian updating
for (i in 1:n_flips) {
  alpha <- alpha + (flips[i] == "H")  # Increase count if roll == 1
  beta <- beta + (flips[i] != "H")    # Increase count otherwise
  posterior_means[i] <- alpha / (alpha + beta)  # Compute posterior mean
}

# Plot posterior mean convergence
plot(1:n_flips, posterior_means, type = "l", col = "blue", lwd = 2,
     xlab = "Number of Rolls", ylab = "Posterior Mean of p(rolling a 1)",
     main = "Bayesian Law of Large Numbers")
abline(h = 1/2, col = "red", lty = 2, lwd = 2)  # True probability reference line

Axioms of Probability

  1. The probability of an event is between 0 and 1 (non-negativity).

\[ 0 \leq P(A) \leq 1 \]

  1. The probabilities must add up to 1. (normalization / unitarity)

\[ P(S) = 1 \]

Axioms of Probability

  1. The probability of mutually exclusive events is additive.

\[ \bigcup\limits_{i=1}^{\infty} A_{i} = \Sigma_{i =1}^\infty P(A_i) \]

Probability of Complementary Events

\[ P(A) + P(A^C) = 1 \]

If we know the the probability that somebody owns a bike is 0.08, then we would know that the probability that somebody does not own a bike is 0.92.

Independence and Multiplication Rule

Two processes are independent if knowing about the outcome of one does not help predict the outcome of the other.

Example:
2 flips of a coin

If events \(A\) and \(B\) are from two independent processes:

\[ P(A \cap B) = P(A) \times P(B) \]

The probability of getting 2 heads in 2 flips:

\[ \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]

Addition Rule

If mutually exclusive \[ P(B\cup C) = P(B) + P(C) \]

If not mutually exclusive \[ P(B\cup C) = P(B) + P(C) - P(B \cap C) \]

Example: What is the probability of a heart or a king?

\[ P(H\cup K) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \]

Example: Data - GSS 2018

The General Social Survey (GSS) is a sociological survey that has been regularly conducted since 1972. It is a comprehensive survey that provides information on experiences of residents of the United States.

Belief in Life After Death
Yes No
Total
College Science Course
Yes 375 75 450
No 485 115 600
Total 860 190 1050

Events

Let \(B\) represent an event that a randomly selected person in this sample believes in life after death.


Let \(C\) represent an event that a randomly selected person in this sample took a college level science course.

Random Variable

A numeric outcome of a random process is called a random variable.

Joint Probability Distribution

Describes the probability of 2 or more random variables occurring.

Note that events \(B\) and \(C\) are not mutually exclusive. A randomly selected person can believe in life after death and might have taken a college science course. \(B \cap C \neq \emptyset\)

\[ P(B \cap C) = \frac{375}{1050} \]

Note that \(P(B\cap C) = P(C\cap B)\). Order does not matter.

Marginal Probability

\(P(B)\) represents a marginal probability. So to does \(P(C)\), \(P(B^C)\), and \(P(C^C)\). In order to calculate these probabilities, we could only use values in the margins of the contingency table:

\[ P(B) = \frac{860}{1050} \]

\[ P(C) = \frac{450}{1050} \]

Conditional Probability

\(P(B|C)\) represents a conditional probability. So to does \(P(B^C|C)\), \(P(C|B)\), and \(P(C|B^C)\). To calculate the probabilities we focus on the row or the column of the given information. We reduce the sample space to this given information.


Probability that a randomly selected person believes in life after death given that they have taken a college science course:

\[ P(B|C) = \frac{375}{450} \]

Conditioning

  • One is isolating the effects of a specific variable.

  • Order Matters!

    • \(P(\text{like dogs | has a dog}) \neq P(\text{has a dog | like dogs})\)

Connecting data to probability

Likelihood Function

\(P(Y|\theta)\) or \(\mathcal{L}\)\((Y|\theta)\)

  • \(Y\) are observed data.
  • \(\theta\) are parameters of the data (i.e, mean, standard deviation, hypothesis,etc.)


The probability of the observed data, given the hypothesis / parameters.

Does NOT sum to 1

SCIENCE!

Most scientific questions stem from the likelihood function.

  • AKA, how does the data match the hypothesis?

Distinction: Downstream interpretations relate to how one constructs (and understands a likelihood).

  • Option A: \(P(Data | \mathbf{Hypothesis})\) - Hypothesis fixed, data varies - Frequentist

  • Option B: \(P(\mathbf{Data} | Hypothesis)\) - Data fixed, hypothesis varies - Bayesian

Quantifying Random Variables

What?

Mathematical functions that define the likelihood of different outcomes of a random variable

  • Remember - they must normalize to 1!

  • Input = Random variable (e.g., Flipping a heads OR individual stature)

  • Output = Probability of that event occurring given the context.

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

Discrete Random Variables

Probability Mass Function (PMF)

Countable number of distinct values

  • Coin flip, die roll, count of students in class, etc.

\[ P(X = x) = f(x) \]

Let \(X\) be the proportion of correct responses on a test with 100 questions. \(S_x = {0.00,0.01,0.02,...,0.98,0.99,1.00}\).

Random process: 3 flips of a single coin

Sample Space: \({𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻,𝑇𝐻𝐻,𝐻𝑇𝑇,𝑇𝐻𝑇,𝑇𝑇𝐻,𝑇𝑇𝑇}\)

Let X be a discrete random variable that represents the total number of heads in 3 coin flips.

# 1. Define the possible outcomes (0, 1, 2, or 3 heads)
x <- c(0, 1, 2, 3)

# 2. Define the probabilities for each outcome
# P(X=0) = 1/8, P(X=1) = 3/8, P(X=2) = 3/8, P(X=3) = 1/8
probs <- c(1/8, 3/8, 3/8, 1/8)

# 3. Create the plot
plot(x, probs, 
     type = "h",             # "h" for high-density vertical lines
     lwd = 5,                # Line width
     col = "steelblue",      # Color
     main = "PMF of Total Heads in 3 Coin Flips",
     xlab = "Number of Heads (x)",
     ylab = "P(X = x)",
     ylim = c(0, 0.5),       # Set y-axis limits
     xaxt = "n")             # Turn off default x-axis to customize

# 4. Add custom x-axis and points on top of the lines
axis(1, at = 0:3)
points(x, probs, pch = 16, cex = 2, col = "steelblue")

Continuous Random Variables

Probability Density Function

The probability of some continuous random variable within some sample space.

Uncountably infinite.

\[ P(a \leq X \leq b) = \int_a^b f(x)dx \]

\[ \int_{x \ \epsilon \ S_X} f(x)dx = 1 \]

Cumulative Distribution Functions (cdf)

  • All PMFs / PDFs have a CDF.

  • Mass / density functions return a single probability value based on 1 random variable.

    • It is often useful in our work to get \(P(X) \leq x\).
    • Think p-values or hypothesis testing.

Probability in R

?distributions

  • There are 21 probability distributions in base R.

  • Each function is governed by parameters

    • For example, mean and standard deviation.
  • All have the following construction:

    • d-xxx \(\rightarrow\) PDF/PMF
    • p-xxx \(\rightarrow\) CDF
    • q-xxx \(\rightarrow\) quantile function / inverse cdf
    • r-xxx \(\rightarrow\) random number generator

Common Discrete Distributions

  • Bernoulli Distribution - single binary event / trial
    • size is always 1 in a bernoulli distribution.
dbinom(x = 0:1, size = 1, prob = 0.35)
[1] 0.65 0.35
barplot(names.arg = 0:1, height = dbinom(0:1, size = 1, p = 0.35), xlab = 'X', ylab = 'Probability')

Common Discrete Distributions

  • Binomial Distribution - total # of successes in \(n\) trials.
dist_1 <- dbinom(x = 1:100, size = 100, prob = 0.6)
plot(dist_1, type= "h")

What if you want the probability of getting a 60 or lower?

pbinom(q = 60, size = 100, prob = 0.6,lower.tail = TRUE) # TRUE is default
[1] 0.5379247

What about greater than 60? [Excluding 60]

pbinom(q = 60, size = 100, prob = 0.6,lower.tail = FALSE) # TRUE is default
[1] 0.4620753

Greater than or equal to 60

pbinom(q = 59, size = 100, prob = 0.6,lower.tail = FALSE) # TRUE is default
[1] 0.5432945

For convincing…

probs <- NULL
x <- 1:60

for(i in 1:length(x)){
  
  probs[i] <- dbinom(x = x[i], size = 100, prob = 0.6)
  
}

sum(probs)
[1] 0.5379247

Probabilities and Quantiles

What if I want the probability between 2 areas? Say 60 and 40.

pbinom(q = 60, size = 100, prob = 0.6) - pbinom(q = 39, size = 100, prob = 0.6,lower.tail = TRUE)
[1] 0.5379066

What if I want the 95th quantile range?

qbinom(p = c(0.025, 0.975), size = 100, prob = 0.6)
[1] 50 69

library(ggplot2)

# Parameters
n <- 100
p <- 0.6

# Quantiles
q_low  <- qbinom(0.025, size = n, prob = p)
q_high <- qbinom(0.975, size = n, prob = p)

# PMF data frame
x_vals <- 0:n
df <- data.frame(
  x    = x_vals,
  prob = dbinom(x_vals, size = n, prob = p),
  region = case_when(
    x_vals < q_low               ~ "Lower Tail (2.5%)",
    x_vals > q_high              ~ "Upper Tail (2.5%)",
    TRUE                         ~ "Middle 95%"
  )
)

# Plot
ggplot(df, aes(x = x, y = prob, fill = region, color = region)) +
  geom_segment(aes(xend = x, yend = 0), linewidth = 0.8) +
  geom_point(size = 2) +
  geom_vline(xintercept = c(q_low, q_high), linetype = "dashed", color = "grey30") +
  annotate("label", x = q_low,  y = max(dbinom(x_vals, n, p)) * 0.9,
           label = paste0("q = ", q_low, "\n(p = 0.025)"),  size = 3.5, color = "#3b82f6") +
  annotate("label", x = q_high, y = max(dbinom(x_vals, n, p)) * 0.9,
           label = paste0("q = ", q_high, "\n(p = 0.975)"), size = 3.5, color = "#ef4444") +
  scale_color_manual(values = c(
    "Lower Tail (2.5%)" = "#3b82f6",
    "Middle 95%"        = "#22c55e",
    "Upper Tail (2.5%)" = "#ef4444"
  )) +
  scale_fill_manual(values = c(
    "Lower Tail (2.5%)" = "#3b82f6",
    "Middle 95%"        = "#22c55e",
    "Upper Tail (2.5%)" = "#ef4444"
  )) +
  labs(
    title    = "Binomial PMF with 95% Central Region",
    subtitle = paste0("X ~ Binomial(n = 100, p = 0.6)  |  95% interval: [", q_low, ", ", q_high, "]"),
    x        = "x",
    y        = "P(X = x)",
    fill     = NULL,
    color    = NULL
  ) +
  xlim(30, 90) +           # zoom in to relevant range
  theme_minimal(base_size = 14) +
  theme(legend.position = "top")

Random Number Generation

Simulate the number of defective items in 100 batches of 50 items each, where the probability of a single item being defective is 0.02:

rbinom(n = 100, size = 50, prob = 0.02)
  [1] 1 1 2 0 1 1 1 2 1 0 1 2 2 0 1 2 0 0 1 0 0 3 1 2 2 2 0 1 1 0 0 0 0 1 1 1 0
 [38] 2 1 1 2 0 0 0 0 2 0 2 2 0 2 0 1 2 0 2 1 3 1 1 1 2 1 2 0 1 1 0 3 2 0 1 1 1
 [75] 3 1 0 1 1 2 0 0 1 0 0 3 0 0 2 0 4 2 0 1 1 0 2 1 1 0

Common Discrete Distributions

  • Binomial / Bernoulli
  • Poisson
  • Geometric

Continuous Random Variables - Simulate

library(ggplot2)
library(dplyr)

# ============================================================
# SETUP: Human Height ~ Normal(mean = 170cm, sd = 10cm)
# ============================================================
mu <- 170   # mean height in cm
sigma <- 10 # standard deviation

# ============================================================
# 1. rnorm() — SIMULATE data (like sampling real people)
# ============================================================
# "If I randomly sampled 1000 people, what heights would I see?"

set.seed(42)
sample_heights <- rnorm(n = 1000, mean = mu, sd = sigma)

df_sample <- data.frame(height = sample_heights)

p1 <- ggplot(df_sample, aes(x = height)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "#93c5fd", color = "white") +
  geom_density(color = "#1d4ed8", linewidth = 1) +
  stat_function(fun = dnorm, args = list(mean = mu, sd = sigma),
                color = "#ef4444", linewidth = 1, linetype = "dashed") +
  labs(title = "rnorm() — Simulated Sample (n = 1,000)",
       subtitle = "Blue = observed density | Red dashed = true Normal(170, 10)",
       x = "Height (cm)", y = "Density") +
  theme_minimal(base_size = 13)

Continuous Random Variables - Probabilities

# ============================================================
# 2. dnorm() — DENSITY (height of the curve at a point)
# ============================================================
# "How relatively likely is it to be exactly 170cm vs 190cm?"
# Note: this is NOT a probability — it's a density value.
# For continuous distributions, P(X = x) = 0 exactly.

x_vals <- seq(130, 210, length.out = 500)
df_density <- data.frame(
  x       = x_vals,
  density = dnorm(x_vals, mean = mu, sd = sigma)
)

# Points to highlight
highlight <- data.frame(
  x     = c(170, 180, 190),
  y     = dnorm(c(170, 180, 190), mean = mu, sd = sigma),
  label = c("dnorm(170) = 0.0399",
            "dnorm(180) = 0.0242",
            "dnorm(190) = 0.0054")
)

p2 <- ggplot(df_density, aes(x = x, y = density)) +
  geom_line(color = "#1d4ed8", linewidth = 1.2) +
  geom_point(data = highlight, aes(x = x, y = y), color = "#ef4444", size = 3) +
  geom_label(data = highlight, aes(x = x, y = y, label = label),
             nudge_y = 0.003, size = 3.2) +
  labs(title = "dnorm() — Probability Density Function (PDF)",
       subtitle = "Height of the curve — higher = more 'likely' region, but not a probability",
       x = "Height (cm)", y = "Density") +
  theme_minimal(base_size = 13)

Continuous Random Variables - cdf (1-tail)

# ============================================================
# 3. pnorm() — CDF & P-VALUES
# ============================================================
# "What is the probability of being shorter than X cm?"
# This is where p-values live — they are areas under the curve.

# --- One-tailed: P(Height < 185) ---
cutoff <- 185
p_lower <- pnorm(cutoff, mean = mu, sd = sigma, lower.tail = TRUE)
# "What % of people are shorter than 185cm?"

df_shade_lower <- df_density %>% filter(x <= cutoff)

p3 <- ggplot(df_density, aes(x = x, y = density)) +
  geom_area(data = df_shade_lower, aes(x = x, y = density),
            fill = "#3b82f6", alpha = 0.4) +
  geom_line(color = "#1d4ed8", linewidth = 1.2) +
  geom_vline(xintercept = cutoff, linetype = "dashed", color = "#ef4444") +
  annotate("label", x = 158, y = 0.025,
           label = paste0("P(X < 185) = ", round(p_lower, 4),
                          "\npnorm(185, 170, 10, lower.tail=T)"),
           size = 3.5, color = "#1d4ed8") +
  labs(title = "pnorm() — CDF: P(Height < 185cm)",
       subtitle = "Shaded area = probability = 0.9332 → ~93% of people are shorter than 185cm",
       x = "Height (cm)", y = "Density") +
  theme_minimal(base_size = 13)

Continuous Random Variables - cdf (2-tail)

Continuous Random Variables - Quantiles

# ============================================================
# 4. qnorm() — QUANTILES (inverse of pnorm)
# ============================================================
# "What height separates the top 5% from everyone else?"
# This is the CRITICAL VALUE concept in hypothesis testing.

q_95   <- qnorm(0.95,  mean = mu, sd = sigma)  # one-tailed 5%
q_025  <- qnorm(0.025, mean = mu, sd = sigma)  # two-tailed 2.5% lower
q_975  <- qnorm(0.975, mean = mu, sd = sigma)  # two-tailed 2.5% upper

df_shade_95   <- df_density %>% filter(x >= q_95)
df_shade_low  <- df_density %>% filter(x <= q_025)
df_shade_high <- df_density %>% filter(x >= q_975)

p5 <- ggplot(df_density, aes(x = x, y = density)) +
  geom_area(data = df_shade_low,  aes(x = x, y = density), fill = "#f97316", alpha = 0.5) +
  geom_area(data = df_shade_high, aes(x = x, y = density), fill = "#f97316", alpha = 0.5) +
  geom_line(color = "#1d4ed8", linewidth = 1.2) +
  geom_vline(xintercept = c(q_025, q_975), linetype = "dashed", color = "#f97316") +
  annotate("label", x = q_025, y = 0.015,
           label = paste0("qnorm(0.025)\n= ", round(q_025, 1), "cm"),
           size = 3.2, color = "#7c2d12") +
  annotate("label", x = q_975, y = 0.015,
           label = paste0("qnorm(0.975)\n= ", round(q_975, 1), "cm"),
           size = 3.2, color = "#7c2d12") +
  annotate("text", x = 170, y = 0.022,
           label = "Middle 95%\n(±1.96 SDs)", size = 3.5, color = "#1d4ed8") +
  labs(title = "qnorm() — Quantiles / Critical Values",
       subtitle = "qnorm(0.025) & qnorm(0.975) give the 95% interval — the origin of the ±1.96 rule!",
       x = "Height (cm)", y = "Density") +
  theme_minimal(base_size = 13)

Why Multiply by 2?